Skip to content

libwholegraph wheels: use nvidia-nccl wheels instead of vendoring libnccl.so#284

Merged
jameslamb merged 2 commits intorapidsai:branch-25.10from
jameslamb:devendor-libnccl
Aug 26, 2025
Merged

libwholegraph wheels: use nvidia-nccl wheels instead of vendoring libnccl.so#284
jameslamb merged 2 commits intorapidsai:branch-25.10from
jameslamb:devendor-libnccl

Conversation

@jameslamb
Copy link
Copy Markdown
Member

@jameslamb jameslamb commented Aug 25, 2025

Fixes #281

Similar to rapidsai/cuvs#827, proposes that wheels get their copy of libnccl.so at runtime from nvidia-nccl-cu{12,13} wheels, instead of vendoring a copying inside libwholegraph.

Notes for Reviewers

Benefits of these changes

  • I think this fill fix the nightly tests
    • @alexbarghi-nv investigated offline and together with some other people found that "competing libnccl between this project and pytorch" may be to blame for the failing nightly tests
  • smaller libwholegraph wheels
    • 120 MiB smaller!
    • now they can go on pypi.org without needing an exception 🎉 🎉 🎉
  • smaller disk footprint for environments
    • via avoiding multiple copies of libnccl.so
  • reduced risks of runtime bugs caused by multiple competing libnccl.so being loaded

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Aug 25, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@jameslamb
Copy link
Copy Markdown
Member Author

/ok to test

@jameslamb jameslamb added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Aug 25, 2025
@jameslamb jameslamb changed the title WIP: libwholegraph wheels: use nvidia-nccl wheels instead of vendoring libnccl.so libwholegraph wheels: use nvidia-nccl wheels instead of vendoring libnccl.so Aug 25, 2025
@jameslamb
Copy link
Copy Markdown
Member Author

I think this is ready for review!

cugraph-pyg wheel tests are failing like this:

/pyenv/versions/3.10.18/lib/python3.10/site-packages/pylibcudf/__init__.py:13: in <module>
    from . import (
E   ImportError: libcudf.so: cannot open shared object file: No such file or director

(build link)

But probably because of the ongoing libcudf.so-loading issues that I think should be fixed by rapidsai/cudf#19786

echo "libwholegraph-${RAPIDS_PY_CUDA_SUFFIX} @ file://$(echo "${LIBWHOLEGRAPH_WHEELHOUSE}"/libwholegraph_*.whl)" >> "${PIP_CONSTRAINT}"

export SKBUILD_CMAKE_ARGS="-DBUILD_SHARED_LIBS=ON;-DCMAKE_MESSAGE_LOG_LEVEL=VERBOSE;-DCUDA_STATIC_RUNTIME=ON;-DWHOLEGRAPH_BUILD_WHEELS=ON"
export SKBUILD_CMAKE_ARGS="-DBUILD_SHARED_LIBS=ON;-DCMAKE_MESSAGE_LOG_LEVEL=VERBOSE;-DCUDA_STATIC_RUNTIME=ON"
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not strictly related, but noticed while I was looking at logs to confirm the NCCL stuff was working correctly.

  CMake Warning:
    Manually-specified variables were not used by the project:

      CUDA_STATIC_RUNTIME
      WHOLEGRAPH_BUILD_WHEELS

(build link)

Think we still want CUDA_STATIC_RUNTIME because it could get passed through here:

add_subdirectory(pylibwholegraph/binding)

But WHOLEGRAPH_BUILD_WHEELS doesn't do anything anywhere in this project.

git grep '_BUILD_WHEEL'

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@jameslamb
Copy link
Copy Markdown
Member Author

/ok to test

@jameslamb jameslamb marked this pull request as ready for review August 25, 2025 20:37
@jameslamb jameslamb requested review from a team as code owners August 25, 2025 20:37
@jameslamb
Copy link
Copy Markdown
Member Author

This will need an admin merge because of the failing check-nightly-ci test...I can do that once this has approvals.

Copy link
Copy Markdown
Member

@alexbarghi-nv alexbarghi-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alexbarghi-nv
Copy link
Copy Markdown
Member

@jameslamb you have my permission to admin merge once we have packaging/CI approval

@jameslamb
Copy link
Copy Markdown
Member Author

Thank you!

Copy link
Copy Markdown
Contributor

@gforsyth gforsyth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice improvement, let's gooooo!


# PyPI limit is 100 MiB, fail CI before we get too close to that
max_allowed_size_compressed = '75M'
max_allowed_size_compressed = '10Mi'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooooh yeaahhh

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥳

Comment on lines +62 to +63
# PyPI limit is 100 MiB, fail CI before we get too close to that
max_allowed_size_compressed = '80Mi'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even better!

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤘🏻

@jameslamb
Copy link
Copy Markdown
Member Author

Thanks @gforsyth ! Ok I'm going to admin-merge this and then manually trigger the nightly tests.

@jameslamb jameslamb merged commit be71c89 into rapidsai:branch-25.10 Aug 26, 2025
48 of 50 checks passed
@jameslamb jameslamb deleted the devendor-libnccl branch August 26, 2025 20:25
@jameslamb
Copy link
Copy Markdown
Member Author

All of the wheel tests passed on the manually-triggered nightly run!

https://github.com/rapidsai/cugraph-gnn/actions/runs/17250350223

Just one conda test failed, and it looks to me like something temporary while downloading a large dataset. I just restarted it... I'm hopeful it'll pass and that nightlies will be working here again.

rapids-bot bot pushed a commit that referenced this pull request Mar 10, 2026
#284 devendored the `libnccl.so` from `libwholegraph` wheels.

That wasn't extended to cover CUDA 13 wheels when added CUDA 13 support, so `libwholegraph-cu13` wheels have been depending on getting this transitively from `raft` and `cugraph` at runtime 😬 

This fixes that.

Authors:
  - James Lamb (https://github.com/jameslamb)

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: #428
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

libwholegraph: stop vendoring NCCL in wheels

4 participants